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Fermi's Sibyl: Mining the gamma-ray sky for dark matter 
subhaloes 
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ABSTRACT 

Dark matter annihilation signals coming from Galactic subhaloes may account for 
a small fraction of unassociated point sources detected in the Second Fermi-LAT 
catalogue (2FGL). To investigate this possibility, we present Sibyl, a Random Forest 
classifier that offers predictions on class memberships for unassociated Fermi-LAT 
sources at high Galactic latitudes using gamma-ray features extracted from the 2FGL. 
Sibyl generates a large ensemble of classification trees that are trained to vote on 
whether a particular object is an active galactic nucleus (AGN) or a pulsar. After 
training on a list of 908 identified/associated 2FGL sources, Sibyl reaches individual 
accuracy rates of up to 97.7% for AGNs and 96.5% for pulsars. Predictions for the 
269 unassociated 2FGL sources at \b\ ^ 10° suggest that 216 are potential AGNs and 
16 are potential pulsars (with majority votes greater than 70%). The remaining 37 
objects are inconclusive, but none is an extreme outlier. These results could guide 
future quests for dark matter Galactic subhaloes. 

Key words: (cosmology:) dark matter - gamma-rays: observations - galaxies: active 
- (stars:) pulsars: general 



1 INTRODUCTION 

The extraordinary success of the Fermi mission marks the 
beginning of the golden age for gamma-ray astrophysics. 
With 24 months of data, the Second Fermi-LAT catalogue 
• (2FGL) lists 1873 sources in the 100 MeV to 100 GeV energy 
range, of which 886 are AGNs and 108 are pulsars. While 
Fermi has greatly mitigated issues inherent to source local- 
isation in the gamma-ray regime, 269 sources in the 2FGL 
(15% of the total) remain without obvious counterparts at 
Galactic latitude |6| ^ 10°. Such failure to associate the en- 
tire Fermi catalogue continues to fuel speculation about the 
existence of new types of gamma-ray source classes. 

Probably the most intriguing potential sources 
of gamma ray em i ssion are dark m atter subhaloes 
l|Diemand et al.ll2008l : fSpringel et al.ll2008l ). Numerical cold 
dark matter (CDM) simulations suggest that galaxies like 
our own are surrounded by a wealth of small dark matter 
subhaloes that survived structure formation l|Klypin et al.l 
ll999l ; lMoore et al.|[l999l) . Massive subhaloes (M > 1O 7 M ) 
would correspond to "classical" dwarf galaxies. Less mas- 
sive ones would be optically elusive and might only be re- 
vealed as gamma-ray point sources when weakly interact- 
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ing massive particles (WIMP s) annihilate to gamma rays 
(Kuhl en. Madau fc S ilk 2009). As a result, nearby dark mat- 
ter subhaloes might be lurking among the unassociated 
Fermi sources at high Galactic latitudes. If found, an an- 
nihilation signal from Galactic subhaloes would clinch the 
first non-gravitational signature of dark matter. 

The hunt for dark matter subh aloes in the Fermi 
catalogue is currently u nderw ay dNieto et al.l 
Belikov, Hooper fc Buckley! l201ll : Zechlin et al 



2011 



2011 



Ackermann et al. 2012bl ). Most approaches involve the 



hypothesised shar p spectral cut-off or step expected at 
the WIMP mass (|Bergstrom et all 120051 ). Assuming that 
the WIMP mass falls between 100 MeV and 50 GeV, a 
dark matter subhalo could be detectable in the Fermi 
MeV-GeV band, but would disappear in the GeV-TeV 
band, effectively creating a TeV dropout. 

Here we investigate the possibility of identifying dark 
matter subhalo candidates using supervised machine learn- 
ing algorithms. Rather than starting with an ad hoc theo- 
retical dark matter spectrum we would like to exploit pat- 
tern recognition of known gamma-ray features in associated 
sources and use this information to locate outliers that might 
constitute novel emitters. Machine learning algorithms have 
already been used to stu dy the First Fermi LAT Cata- 
logue (1FGL). For example. I Ackermann et all (|2012al ) inves- 
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Figure 1. Gamma-ray features ranked in order of importance. 
MeanDecreaseAccuracy measures the difference between accuracy 
rates before and after permutation of individual features averaged 
over all trees. Higher percentages indicate more importance. 



Figure 2. Properties of Fermi features plotted against each other. 
Top features include index, curve, variability, and 1st and 2nd 
scaling coordinates (1 and 2 respectively) generated by Sibyl. Two 
distinct classes are clear: AGNs (red) and pulsars (green). 



tigated classification trees and logistic regression to predict 
classes of unassociated sources in the 1FGL based on a set of 
gamma-ray features. K-means clustering was also applied to 
help disti nguish individual counterparts within Fermi error 
contours {Mirabal. Nieto. fe Pardcll20ld ). 

With an additional year of collected Fermi data, the 
gamma-ray features reported in the 2FGL have improved 
subst antially. In this paper we train the Random Forest clas- 
sifier (Brcirnan 2001) on identified/associated Fermi objects 
and build a set of decision trees that provide predictions 
for high-latitude unassociated Fermi objects in the 2FGL. 
The paper is organised as follows. In section [2] we describe 
the datasets and the Random Forest algorithm. Section [3] 
describes the performance of the classifier on unassociated 
Fermi sources. Section [4] details the search for potential out- 
liers. Finally, we provide our conclusions and discuss future 
work in section [5] 



2 DATASETS AND RANDOM FORESTS 

Random Forest is an ensem ble classifier th at grows a large 
forest of classification trees l)Breimanll200l] ). Decision trees 
are classification tools that have a tree structure, where 
each split is based on the inform ation gained conside ring 
the elements of the feature space {Quintan et al.lll994h . To 
classify a new object, each tree in the forest votes on the 
class. The proportion of votes P that agree on a decision 
provides a measure of the accuracy of the classification. 
Random Forest then makes a prediction based on the ma- 
jority of votes (P > 0.5). Random Forest also computes 
proximities between pairs of objects and produces scaling 
coordinates (1 st and 2nd) that can be used to visualise 
datasets easilv (|Cox fc Coxll200ll ). In addition, a number of 
comparisons have shown that Random Forest is unexcelled 
in accuracy among current cla s sifiers (jSvetnik et alJ 120031 : 
iQi et al.|[200fj ; ISoto et ai]|201ll ; iRichards et al l 120121 ). The 



analysis presented he re uses the R randomForest package 
{Liaw fc Wienerll2002l ). 

For our dataset, we collected the complete Fermi LAT 
2FGL catalogue that consists of 1873 sources (100 MeV- 
100 GeV) of which 1300 are firmly i dentified/associated 
and 573 are unassociated sources (|Abdo et all |2012| ; 
lAckermann et al]|201ll ). In total, we consider a list that in- 
cludes 800 labelled AGNs (BL Lacs and flat-spectrum radio 
quasars only) and 108 pulsars. There are additional gamma- 
ray classes in the 2FGL, but AGNs and pulsars are the 
largest and most common at |6| ^ 10°. Thereby we simply 
consider a bimodality of classes. Novelty detection will be 
discussed later on. For each of the 908 sources a total of 68 
features are reported in the 2FGL. Features include Galac- 
tic latitude, Galactic longitude, spectral index (Index), vari- 
ability, curvature index (Curve), and fluxes in five bands. In 
addition, we generate four derived features defined by flux 
ratios FRij = FluXi/FluXj between consecutive bands for 
0.1-0.3 GeV (Band 1), 0.3-1 GeV (Band 2), 1-3 GeV (Band 
3), 3-10 GeV (Band 4), and 10-100 GeV (Band 5) compa- 
rable t o the features first introduced by lAckermann et all 
(|2012ah . 

To avoid working with too many features that could 
generate noise in the classifier, we first identify the sub- 
set of features that best discriminates what constitutes an 
AGN or a pulsar. For that purpose, we compute the rele- 
vance of each feature towards the target class, rank them 
by importance, and apply the classifier to a subset of the 
most relevant ones. Specifically, we use the measure of im- 
portance - Me anDecreasedAccuracy - implemen ted within 
randomForest {Breimanll200ll ; ISvetnik et ai1l2004l ) . Initially, 
the accuracy rate is computed for each tree as the Random 
Forest is constructed. The value of a particular feature is 
then permuted across all the objects while other features are 
left unchanged, and the accuracy rate is recorded again. The 
MeanDecreaseAccuracy is the overall percentage decrease in 
accuracy rate averaged over all trees. If the feature is impor- 
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Table 1. Predictions and voting percentages for unassociatcd Fermi sources in the 2FGL, ordered by RA 



Source 




Pagn 


f Pulsar 


Prediction 


2FGL 


J0004. 


94-99DX 





.974 





.026 


AGN 


2FGL 


T001 A 


o uouy 


n 

u. 


QQ9 


n 


. uuo 


AGN 


2FGL 


J0031. 


0+0724 





.946 





.054 


AGN 


2FGL 


J0032. 


7-5521 





.998 





.002 


AGN 


2FGL 


J0039. 


1+4331 





.776 





.224 


AGN 


2FGL 


J0048. 


8-6347 





.922 





.078 


AGN 


2FGL 


J0102. 


2+0943 





.998 





.002 


AGN 


2FGL 


J0103. 


8+1324 





.998 





.002 


AGN 


2FGL 


J0106. 


5+4854 





.406 





.594 




2FGL 


J0116. 


6-6153 





.992 





.008 


AGN 


2FGL 


J0124. 


6-2322 





.998 





.002 


AGN 


2FGL 


J0129. 


4+2618 





.820 





.180 


AGN 


2FGL 


J0133. 


4-4408 





.968 





.032 


AGN 


2FGL 


J0143. 


6-5844 


1 


.000 





.000 


AGN 


2FGL 


J0158. 


4+0107 





.990 





.010 


AGN 



Note: The complete list of predictions is available at http://www.gae.ucm.es/~mirabal/sibyl.html 



tant, there should be a greater decrease in the accuracy rate 
compared to the initial one. Figure Q] shows the top most 
important features ranked by importance. We found that 
the features that most clearly differentiate AGN and pulsar 
classes include: Index, Curve, Variability, and Flux Ratios 
(FR12, FR23, FR34, and FR45). This sel ection is in general 
agreement with lAckermann et alJ ()2012al ) who chose similar 
features for supervised classification of the 1FGL. Additional 
features showed considerably smaller values in their impor- 
tance (MeanDecreaseAccuracy) and are thus not considered 
in the analysis. 

In order to construct and train we start with 

the 800 AGNs and 108 pulsars. However, given the highly 
unbalanced nature of the sets, we replicate t he pulsar sample 
to attain a c loser size as the AGN class dLing fc Lilll998l : 
IChawla|[2005h . Practically, the content of the datasets have 
not changed but the replication mechanism adds weight to 
the minority sample and achieves improved performance in 
the classifier. 

After matching the AGNs and pulsar datasets, we gen- 
erate 100 alternate training and testing sets built from ran- 
domly selected objects (2/3 and 1/3 of the sample respec- 
tively). We next produce Random Forest models with 500 
trees for each training set. For validation, individual per- 
formance is evaluated at each of the 100 testing sets. Ac- 
curacy rates are computed directly by comparing the class 
predicted by Sibyl with the true class for each object in 
the testing sets. On average, Sibyl achieves an accuracy rate 
of 97.1% based on majority voting (97.7% for AGNs and 
96.5% for pulsars). Inclusion of absolute Galactic latitude 
1 6 1 in the classifier lowered AGN and pulsar accuracy rates 
slightly to 97.4% to 95.5% respectively. Since pulsars tend 
to be situated along the Galactic plane and AGN are more 
numerous at high Galactic latitude, it is possible that us- 
ing Galactic latitude as a feature could introduce a tiny bias 
ag ainst AGN near the Gala ctic plane and pulsars away from 
it l|Ackermann et aL 2012a). Generally, most of the misclas- 
sifications occur when less than 70% of the individual trees 



1 In ancient Greece, a sibyl was a person or agency considered to 
be a source of predictions or oracles. 



(P < 0.7) agree on a classification. Figure [5] displays the 
outstanding separation between AGNs and pulsars, which 
explains the high accuracy rates obtained by Sibyl. 



3 APPLICATION TO UNASSOCIATED 
SOURCES 

The designation of 2FGL sources usually falls into three 
categories: identified, associated, and unassociated. A firm 
identification of a gamma-ray source can only be estab- 
lished through contemporaneous temporal variability, sim- 
ilar spatial morphology, or equivalent pulsation at other 
wavelengths. An association only requires positional corre- 
spondence of a possible counterpart with a 2FGL source. 
Unassociated sources lack a formal counterpart at other 
wavelengths. 

Here, we consider a fourth category to designate 2FGL 
sources: "prediction". Our objective is twofold: to predict 
the class of high-latitude unassociated Fermi objects in the 
2FGL; and to produce a list of outliers that could be ex- 
plored as potential dark matter subhaloes. For each of the 
269 unassociated Fermi sources at ^ 10°, Sibyl provides 
a prediction that the object is an AGN (Pagn) or a pulsar 
(Ppuisar) based on individual votes polled from all trees in 
the decision forest. 

Since we want to isolate outliers that might constitute 
dark matter subhalo candidates, we only accept Sibyl predic- 
tions whenever P > 0.7 i.e., at least 70% of the trees agree 
on the final decision. Otherwise, the object remains without 
a prediction. Such threshold value is set based on the results 
explained in Section [5] In total, Sibyl predicts 216 objects 
to be AGN and 16 to be pulsars. The resulting predictions 
and percentages of voting agreements are listed in Table Q] 
Finally, the remaining 37 objects left without a firm predic- 
tion are the focus of our outlier study in the next section. It 
is important to note that under some specific circumstances, 
dark matter subh aloes could mimic the spectral properties 
of cer tain pulsars jBaltz. Taylor fc Wai"ll2007l ; IZechlin et al.1 
l201ll) . we discuss this possibility further in [5] 
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the 37 objects without firm predictions. We find that the top 
five outliers have an average flux of 1.1 x 10" 9 ph cm s 
(1-100 GeV). Unassociated source fluxes at high latitudes 
range from 7.7 x 10~ 9 to 1.1 x 10~ 10 ph cm~ 2 s" 1 . Thus, 
they are not necessarily the faintest sources in the dataset. 
On the other hand, the mean photon index of sources in Ta- 
ble [2] is 2.2 ± 0.3, while photon indices in the unassociated 
sample range from 1.1 to about 3.0. Inspection of individ- 
ual features in this manner yields limited insight into what 
makes these outliers stand out from the rest of the sample. 
As mentioned before, the exploration of the entire feature 
space is precisely where the supervised learning algorithm 
excels. Unfortunately, Sibyl cannot assess by itself whether 
the outlyingness is due to an anomaly in the data taking 
process, a simple variation within known Fermi classes, or a 
true novel source class such as dark matter annihilation in 
Galactic subhaloes. 



Figure 3. Distribution of outlyingness for the 37 objects without 
firm predictions (shaded circles) and the 232 predicted by Sibyl 
(open circles). The top outliers are summarised in Table [2] 



Table 2. Top outliers among high-latitude unassociated sources 
in the 2FGL 



Source 


Pagjv 


Outlyingness 


2FGL J0953.6-1504 


0.658 


9.0 


2FGL J0418.9+6636 


0.574 


7.2 


2FGL J1710.0-0323 


0.500 


7.1 


2FGL J0533.9+6759 


0.336 


6.6 


2FGL J0336. 0+7504 


0.476 


6.2 



4 SEARCH FOR DARK MATTER 
SUBHALOES IN THE 2FGL 

In order to better understand the nature of the remaining 
37 objects we want compute their outlyingness, which is a 
measure of how far away an object is from its closest class. 
Apart for predicting an object's class, Random Forest com- 
putes the proximity of each predicted Fermi object n to 
every element k within each class ^2 eclasB prox(n, k). For- 
mally, each individual proximity prox(n, k) is computed as 
the fraction of trees in which elements n and k fall in th e 
same terminal node (|Breimanll2"00ll :l Liaw fe Wiened 120021) . 
The outlyingness of an element n is calculated as the recip- 
rocal sum of the squared proximities to all objects within 
its class. This outlying measure is normalised by subtract- 
ing the med ian and dividing by th e absolute deviation from 
the median (|Liaw fc Wienerl l2002). Larger outlyingness val- 
ues are common in objects that are extremely different from 
the average, which could correspond to dark matter sub- 
haloes. Figure [3] shows the distribution of outlyingness for 
the 37 objects without firm predictions. For comparison, we 
also plot the outlyingness for the remaining 232 objects that 
were predicted by Sibyl in the previous section. Additionally, 
Table [5] lists the five objects with the largest outlyingness. 

Given that outlyingness values much gr eater than 10 
usually indicate novel cases l|Breimanl 1200 ll ) , there is no 
strong indication of novelties (significant outliers) among 



5 DISCUSSION AND CONCLUSIONS 

We have presented the outcome of the Random Forest pre- 
dictor Sibyl. The results show that machine learning algo- 
rithms provide a reasonable route not only to predict unas- 
sociated AGNs/pulsars in the 2FGL, but also to produce a 
list of sources with unusual features that could be explored 
as potential dark matter subhalo candidates. After training 
on 908 identified/associated Fermi objects, Sibyl has been 
applied to predict the class of unassociated Fermi sources 
in the 2FGL. Out of 269 unassociated sources at high lat- 
itudes, we have found that 216 are AGN candidates and 
16 are considered potential pulsars with prediction accuracy 
rates greater than 96.5%. Sibyl has also produced a list of 
37 outlier objects; however, none of these exhibits significant 
outlyingness that can be directly connected to new gamma- 
ray classes (including dark matter subhaloes) at this point. 
We emphasise again that our results are strict predictions 
based on pattern recognition and thus a rigourous source 
identification process will have to localise actual counter- 
parts at other wavelengths. 

The results leave some room, albeit very small, to ac- 
commodate dark matter subhaloes or alternative source 
classes in the 2FGL. These pockets could be targeted to 
exhaust all possibilities. Looking forward, zooming in on 
a reduced group of sources might be a wise observational 
strategy. For obvious reasons, the set of objects with the 
largest outlyingness could be a reasonable place to conduct 
a dedicated survey. If d ark matter consists of p articles with 
a mass below 60 GeV (|Hooper fc Lindenll201ll ). dark mat- 
ter subhaloes might also be camouflaging among the ranks 
of predicted pulsars as their spectral signature could be sim- 
ilar to the pronounced spectral cut-off predicted predicted 
by certain dark matter models. However, a number of these 
sources could be old radio-quiet pulsars which wil l compli- 
cate the search for a counterpart (Ker r et al.ll2012T ). 

There are a number of issues that need further ex- 
ploration. For instance, the predictions are heavily depen- 
dent on the robustness of the spectral parameters listed in 
the 2FGL. Most machine learning algorithms lack a proper 
treat ment of uncertainties in each of the fe atures consid- 
ered (|Carroll et al.ll2006l ; iMorgan et al.ll2012l ). Inclusion of 
uncertainties as individual features in Sibyl did not yield 
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improved performances in our predictions. With additional 
years of flight, Fermi will likely keep improving the accu- 
racy of the gamma-ray features. However, attempts should 
be made to account for feature errors properly. In a forth- 
coming paper, we plan to explo re a more refined bre akdown 
into further Fermi subclasses (|Hassan et al.l I2012T ). There 
are at least four AGN subclasses in the 2FGL comprising 
BL Lacs, flat-sp ectrum radio quasars, m isaligned AGNs, and 
Seyfert galaxies (|Ackermann et all2 011'l. Pulsars can be fur- 
ther par titioned into radio -loud, radio-quiet, and millisecond 
pulsars (|Abdo et alj|2010h . 

Ultimately, the main reason that a large Fermi fraction 
remains unassociated to begin with has to do with the qual- 
ity of localisations in the gamma-ray band. At faint flux 
levels, it becomes ever more difficult to associate a Fermi 
source with a particular counterpart. The best association 
procedures rely on positional coin cidences and correlation s 
with flat-spectrum radio sources (Ac kermann et all 1201 if ) . 
None the less, considering the results presented here and 
the scatter in gamma-ray flux it seems likely that many 
of the unassociated sources at high latitude are AGNs or 
mid-latitude pulsars with somewhat fainter radio fluxes than 
their brighter cousins. 

Without a major breakthrough in localisations, the ac- 
tual counterparts of most unassociated Fermi objects will 
be difficult to pinpoint in the short term. Machine learning 
algorithms can help narrow the options. Eventually, we will 
see significant improvement in localisations, particularly for 
Galactic sources, courtesy of the future Cherenkov Telescope 
Array (CTA) t hat will achieve enhance d angular resolution 
above 25 GeV l|CTA Consortiumll201lf ). 
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